[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting#389
[Prototype] Integration with PipelineRL: streaming dataset and trainer events with weight broadcasting#389
Conversation
…M into denis/new_datasets
…enis/new_datasets
| ) | ||
|
|
||
| super()._validate() | ||
| rank, global_ranks = self._get_model_and_sequence_data_rank_and_global_ranks() |
There was a problem hiding this comment.
We don't care about PP so this is the same as tensor_and_sequence_data right?
There was a problem hiding this comment.
Yes, but you cannot easily calculate global ranks for tensor_and_sequence_data is some configuration
variation, right? In that case, calculating them via simulation might actually be a good approach, even for
tensor_and_sequence_data.
Also, are we dropping pipeline parallelism entirely? If not, would it be better to use model_and_sequence_data?
There was a problem hiding this comment.
The ranks for tensor_and_sequence_data are easy to calculate https://github.com/ServiceNow/Fast-LLM/pull/389/files#diff-c76c17be20bd8a5658b40b2b9301c3fce5d6c06cf5341a8767f9339e36378f90R358. model_and_sequence_data has a slightly more complicated computation because it has two strides, but let's not worry about it.
There was a problem hiding this comment.
I’ve seen your latest change in the distributed setup — thanks, I agree it’s possible. However, if we want to change the arrangement, we’ll need to modify the ranks and strides function again. With the simulation, you only change the assigned ranks and the groups are created automatically.
Why would we want to change something? For example, I saw in one configuration that we have more than one data group on the same node, while its tensor and pipeline parallel parts are placed on another node. As a result, we end up with two streaming dataset readers on one node and none on another.
Ideally, we would want to have one data group per node, with its TP and PP parts on the same node as much as possible. At the very least, we should avoid having two data-group leaders on the same node. What do you think?
| @@ -0,0 +1,105 @@ | |||
| import logging | |||
|
|
|||
| import orjson | |||
There was a problem hiding this comment.
Any good reason why we need this over the stock json?
There was a problem hiding this comment.
It is used in PipelineRL as it is much faster, i decided to use the same on our side
|
Superseded by #428 — refactoring and simplification of the prototype, along with changes to distributed code and tests. |
✨ Description
This PR provides the initial integration with PipelineRL.
It introduces:
training_finished,initial_weights_stepandweights_readyover a dedicated Redis channel.This enables seamless coordination between Fast-LLM training and PipelineRL-based inference or orchestration components.
Closes #
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing
Performance Impact
📊 Performance Impact Details
If there is any impact on performance, describe it and provide benchmark results, if applicable:
🗒️ Additional Notes
Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.